1 Question 1: Reading in the Gapminder Data into R

suppressWarnings({
  
GapminderData <- read_csv(file = Gapminder_Filelink) %>%
  as_tibble(show_col_types = FALSE) %>%
  select(-`...1`)

})
## New names:
## Rows: 2607 Columns: 20
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (2): Country Name, continent dbl (18): ...1, Year, Agriculture, value added (%
## of GDP), CO2 emissions (me...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`

What we see here is the Gapminder dataset (even though it says it’s cleaned it’s not….). This dataset details various metrics, ranging from economic to agriculture, that describes specific countries within the world over time.

1.1 Question 2: Filter Gapminder dataset by year of 1962 and make a scatter plot

filtered_year <- 1962
GapminderFilteredYear <- GapminderData %>%
  drop_na(gdpPercap,`CO2 emissions (metric tons per capita)`) %>%
  mutate(`log10(gdpPercap)` = log(gdpPercap)) %>%
  mutate(`log10(CO2 emissions (metric tons per capita))` = log(`CO2 emissions (metric tons per capita)`)) %>%
  dplyr::filter(Year == filtered_year)

2 Question 2 (cont.): Make a scatter plot of the filtered dataset based on CO2 emissions and gdpPercap

ScatterPlot <- GapminderFilteredYear  %>%
  ggplot(., aes(
    x = `log10(gdpPercap)`,
    y = `log10(CO2 emissions (metric tons per capita))`
  )) +
  geom_point() +
  ggtitle("Relationship between CO2 emissions and gdpPerCap in 1962") +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))

From our dataset, we can see that there is a positively linear relationship between CO2 emissions and GDP per capita. Now lets investigate further on how strong the correlation is based on the pearson correlation (R) coefficient.

2.1 Question 3: On the filtered data, calculate the pearson correlation of ‘CO2 emissions (metric tons per capita)’ and gdpPercap. What is the Pearson R value and associated p value?

test <- "pearson"
rm_na <- "complete.obs"
pearson_corr <- cor(GapminderFilteredYear$`log10(CO2 emissions (metric tons per capita))`,
  GapminderFilteredYear$`log10(gdpPercap)`,
  method = test,
  use = rm_na
) * 100

## The pearson correlation coefficient between CO2 emissions and GDP per capita is 86%

From what we can see here, the pearson correlation coefficient is approximately 86%, meaning that there is a strong positive correlation between CO2 emissions and GDP per capita in all countries in the year of 1962. In addition, the p-value (2.2 * 10^-6) is less than 0.05, meaning that the correlation of the two variables are significant to one another. Now lets take a look at all years and see which has the highest pearson correlation coefficient.

2.2 Question 4: On the unfiltered data, answer “In what year is the correlation between ‘CO2 emissions (metric tons per capita)’ and gdpPercap the strongest?” Filter the dataset to that year for the next step…

test <- "pearson"
rm_na <- "complete.obs"
Gapminder_summarize_df <- vector(mode = "list")



GapminderYear <- GapminderData %>% # selecting the all the unique years iteration
  select(Year) %>%
  unique() %>%
  pull() %>%
  as.numeric() # For names in the list

suppressWarnings({
  
  Gapminder_summarize_df = summarise(GapminderData,
          Year,gdpPercap,
          `CO2 emissions (metric tons per capita)`)
  
})

PearsonCorrYears <- GapminderYear %>% # Make into a list by iterating through the years
  sapply(.,
    USE.NAMES = TRUE,
    simplify = FALSE,
    function(year) {
      
      cor(
        x = Gapminder_summarize_df %>%
          dplyr::filter(Year == year) %>%
          dplyr::select(`CO2 emissions (metric tons per capita)`) %>%
          dplyr::pull(),
        y = Gapminder_summarize_df %>%
          dplyr::filter(Year == year) %>%
          dplyr::select(gdpPercap) %>%
          dplyr::pull(),
        method = test,
        use = rm_na
      )
    }
  ) %>% unlist()

#Adding the names for the correlation values
names(PearsonCorrYears) <- GapminderYear

After iterating over the years in the Gapminder dataset, we can see that the highest Pearson correlation coefficient occurs in 1967 suggesting that year has the strongest correlation (93.88%) between CO2 emissions and GDP per capita. Now lets filter the Gapminder dataset again with that year and plot a scatterplot through plotly.

2.3 Question 5: Using plotly, create an interactive scatter plot comparing ‘CO2 emissions (metric tons per capita)’ and gdpPercap, where the point size is determined by pop (population) and the color is determined by the continent. You can easily convert any ggplot plot to a plotly plot using the ggplotly() command.

PearsonCorrMaxYear <- PearsonCorrYears[which.max(PearsonCorrYears)] %>%
  names() %>%
  as.numeric() # Finding the max year for the analysis

GapminderFilteredMax <- GapminderData %>% ## Filter by year with the highest Pearson
  filter(Year == PearsonCorrMaxYear) %>% ## correlation coefficient of CO2 and GDP
  mutate(`log10(CO2 emissions (metric tons per capita))` = log10(`CO2 emissions (metric tons per capita)`),
         `log10(gdpPercap)` = log10(gdpPercap))

GapminderMaxplot <- GapminderFilteredMax %>% ## ggplot implementation
  ggplot(., aes(
    x = `log10(CO2 emissions (metric tons per capita))`,
    y = `log10(gdpPercap)`,
    size = pop,
    color = continent
  )) +
  geom_point() + 
  ggtitle("Relationship between Log Transformed gdpPercap & CO2 emissions") +
  theme_classic()

Here is the plotly implementation of the Gapminder dataset during the year of 1962. The scatterplot is interactive and you can see the different values (gdpPercap, CO2 emissions) in each point of the plot.

2.3.1 New Question 1: What is the relationship between continent and ‘Energy use (kg of oil equivalent per capita)’? (stats test needed)

GapminderContinentEnergyUse <- GapminderData %>%
  dplyr::mutate(`log(Energy use (kg of oil equivalent per capita))` = log10(`Energy use (kg of oil equivalent per capita)`)) %>%
  select(continent, `log(Energy use (kg of oil equivalent per capita))`) %>%
  na.omit()

In order to determine the exact relationship between the each continent (categorical variable) and Energy use (kg of oil equivalent per capita) (continuous variable), we need to construct a linear model of the data and do ANOVA across all the continent groups.

lm(
  formula = `log(Energy use (kg of oil equivalent per capita))` ~ continent,
  data = GapminderContinentEnergyUse
) %>% summary()
## 
## Call:
## lm(formula = `log(Energy use (kg of oil equivalent per capita))` ~ 
##     continent, data = GapminderContinentEnergyUse)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.73812 -0.21810 -0.03864  0.17894  1.16710 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        2.72558    0.02627 103.746  < 2e-16 ***
## continentAmericas  0.27191    0.03769   7.214 1.21e-12 ***
## continentAsia      0.23187    0.03785   6.126 1.38e-09 ***
## continentEurope    0.70072    0.03502  20.007  < 2e-16 ***
## continentOceania   0.85552    0.08694   9.841  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3706 on 843 degrees of freedom
## Multiple R-squared:  0.3583, Adjusted R-squared:  0.3553 
## F-statistic: 117.7 on 4 and 843 DF,  p-value: < 2.2e-16

From the results above, we have the following equation:

Y = continentAmericas * x1 + continentAsia * x2 + continentEurope * x3 + continentOceania * x4 + Intercept

Definitions of the variables:
Y = Energy use (kg of oil equivalent per capita) (continuous variable)
x1 = North and South America (categrorical variable i.e 0 or 1)
x2 = Asia (categrorical variable i.e 0 or 1)
x3 = Europe (categrorical variable i.e 0 or 1)
x4 = Oceania (categrorical variable i.e 0 or 1)
continentAmericas = 0.27191
continentAsia = 0.23187
continentEurope = 0.70072
continentOceania = 0.85552
Intercept = 2.72558

We can conclude that since all the p-values are less than 0.05 for each continent, we can reject the null hypothesis and interpret that all the continent variables have a statistically significant relationship. Furthermore, since all the continent coefficents are positive, we can see that each continent has a positive relationship with the Energy use (kg of oil equivalent per capita).

2.3.2 New Question 2: Is there a significant difference between Europe and Asia with respect to ‘Imports of goods and services (% of GDP)’ in the years after 1990? (stats test needed)

continents_of_interest <- c("Europe", "Asia")
variables_of_interest <- c("continent", "Year", "Imports of goods and services (% of GDP)")

Euro_Asia_Imports <- GapminderData %>%
  dplyr::filter(continent %in% continents_of_interest &
    Year > 1990) %>%
  dplyr::select(variables_of_interest) %>%
  na.omit()
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
##   # Was:
##   data %>% select(variables_of_interest)
## 
##   # Now:
##   data %>% select(all_of(variables_of_interest))
## 
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
ggboxplot(Euro_Asia_Imports,
  x = "continent",
  y = "`Imports of goods and services (% of GDP)`",
  color = "continent",
  add = "jitter",
  shape = "continent"
)

In order to determine whether or not there is a significant difference between Europe and Asia based on Imports of goods and services (% of GDP) after 1990, we need to determine whether we use the t-test or the Welch t-test.

From what we see from this boxplot above, there are small differences of variation (i.e. spread) of the data between Asia and Europe, so the Welch t-test is the most appropriate for this type of test.

t.test(`Imports of goods and services (% of GDP)` ~ continent,
  data = Euro_Asia_Imports
)
## 
##  Welch Two Sample t-test
## 
## data:  Imports of goods and services (% of GDP) by continent
## t = 1.3552, df = 137.53, p-value = 0.1776
## alternative hypothesis: true difference in means between group Asia and group Europe is not equal to 0
## 95 percent confidence interval:
##  -2.321099 12.433240
## sample estimates:
##   mean in group Asia mean in group Europe 
##             46.84531             41.78924

Since the p-value (0.1776) > 0.05, we need to fail to reject the null hypothesis. We can conclude that there is not a statistically significant difference between Europe and Asia with respect to ‘Imports of goods and services (% of GDP)’ in the years after 1990.

2.3.3 New Question 3: What is the country (or countries) that has the highest ‘Population density (people per sq. km of land area)’ across all years? (i.e., which country has the highest average ranking in this category across each time point in the dataset?)

suppressWarnings({

Gapminder_summarize_PopDensity_df = summarise(GapminderData,
          Year,
          `Population density (people per sq. km of land area)`,
          `Country Name`)

})

Country_name = Gapminder_summarize_PopDensity_df %>%
  dplyr::select(`Country Name`) %>% 
  dplyr::pull() %>%
  unique() 

MeanPopDensity_Country = sapply(X = Country_name,
       simplify = FALSE,
       FUN = function(country){
         
         Gapminder_summarize_PopDensity_df %>%
           dplyr::filter(`Country Name` == country) %>%
           dplyr::select(`Population density (people per sq. km of land area)`) %>%
           dplyr::pull() %>%
           mean()
         
         
         
         
}) %>% unlist()
MeanPopDensity_Country %>% ## Visualizing the highest pearson correlation coefficent by DT
  sort(., decreasing = TRUE) %>%
  as.data.frame() %>%
  rownames_to_column() %>%
  rename(
    Country = rowname,
    `Mean Population density (people per sq. km of land area)` = "."
  ) %>%
  datatable()

The datatable below shows which country has the highest average population density in each time point in the dataset. From what we see, Monaco and Macao SAR, China have been dominating the population density from 1962-2007.

2.3.4 New Question 4: What country (or countries) has shown the greatest increase in ‘Life expectancy at birth, total (years)’ since 1962?

LifeEx_Country = vector(mode = "list")

GapminderCountry <- GapminderData %>% # selecting the all the unique years iteration
  select(`Country Name`) %>%
  unique() %>%
  pull()

suppressWarnings({
  
  Gapminder_LifeEx_df = summarise(GapminderData,
          `Year`,
          `Country Name`,
          `Life expectancy at birth, total (years)`)
  
})

PearsonCorrLifeEx_Country_NA <- GapminderCountry %>% # Finding countries that have NAs for Life expectancy
    sapply(.,
           USE.NAMES = TRUE,
           simplify = FALSE,
           function(country) {
      
      LifeEx_Country[[country]] = Gapminder_LifeEx_df %>%
          dplyr::filter(`Country Name` == country) %>%
          dplyr::select(`Life expectancy at birth, total (years)`) %>%
          dplyr::pull()
      
      LifeEx_Country[[country]] %>% is.na() %>% sum()
      

}) %>% unlist() 
PearsonCorrLifeEx_Country_NA[PearsonCorrLifeEx_Country_NA != 0]
##            American Samoa                   Andorra                   Bermuda 
##                        10                        10                         8 
##    British Virgin Islands            Cayman Islands                   Curacao 
##                        10                        10                         9 
##                  Dominica             Faroe Islands                 Gibraltar 
##                         5                         5                        10 
##                 Greenland               Isle of Man                    Kosovo 
##                         4                         9                         4 
##             Liechtenstein          Marshall Islands                    Monaco 
##                         7                         5                        10 
##                     Nauru  Northern Mariana Islands                     Palau 
##                        10                         4                        10 
##                San Marino                    Serbia                Seychelles 
##                         7                         7                         4 
## Sint Maarten (Dutch part)       St. Kitts and Nevis  St. Martin (French part) 
##                         2                         5                         4 
##  Turks and Caicos Islands                    Tuvalu 
##                        10                        10

Removing these countries since they don’t have complete values for all years in the ‘Life expectancy at birth, total (years)’ column.

test <- "pearson"
rm_na <- "complete.obs"
LifeEx_Country = vector(mode = "list")
Year_Country = vector(mode = "list")

GapminderCountryLifeEx = PearsonCorrLifeEx_Country_NA[PearsonCorrLifeEx_Country_NA == 0] %>%
  names() #No NAs

PearsonCorrCountryLifeEx <- GapminderCountryLifeEx %>% # Make into a list by iterating through the years
  sapply(.,
    USE.NAMES = TRUE,
    simplify = FALSE,
    function(country) {
      
      LifeEx_Country[[country]] = Gapminder_LifeEx_df %>%
          dplyr::filter(`Country Name` == country) %>%
          dplyr::select(`Life expectancy at birth, total (years)`) %>%
          dplyr::pull()
      
      Year_Country[[country]] = Gapminder_LifeEx_df %>%
          dplyr::filter(`Country Name` == country) %>%
          dplyr::select(Year) %>%
          dplyr::pull()
      
      cor(x =  LifeEx_Country[[country]],
          y =  Year_Country[[country]],
          method = test,
          use = rm_na)
        
      }) %>% unlist()
PearsonCorrCountryLifeEx %>% ## Visualizing the highest pearson correlation coefficent by DT
  sort(., decreasing = TRUE) %>%
  as.data.frame() %>%
  rownames_to_column() %>%
  rename(
    Country = rowname,
     `Pearson Correlation Coefficent` = "."
  ) %>%
  datatable()

From what we see here, after we removed countries with incomplete entries for all the years of life expectancy, Maldives has the highest increase percentage of Life expectancy at birth in years based on the Pearson Correlation coefficient.